Exponential Reservoir Sampling for Streaming Language Models

نویسندگان

  • Miles Osborne
  • Ashwin Lall
  • Benjamin Van Durme
چکیده

We show how rapidly changing textual streams such as Twitter can be modelled in fixed space. Our approach is based upon a randomised algorithm called Exponential Reservoir Sampling, unexplored by this community until now. Using language models over Twitter and Newswire as a testbed, our experimental results based on perplexity support the intuition that recently observed data generally outweighs that seen in the past, but that at times, the past can have valuable signals enabling better modelling of the present.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams

Data intensive applications and computing has emerged as a central area of modern research with the explosion of data stored world-wide. Applications involving telecommunication call data records, web pages, online transactions, medical records, stock markets, climate warning systems, etc., necessitate efficient management and processing of such massively exponential amount of data from diverse...

متن کامل

Stratified Reservoir Sampling over Heterogeneous Data Streams

Reservoir sampling is a well-known technique for random sampling over data streams. In many streaming applications, however, an input stream may be naturally heterogeneous, i.e., composed of substreams whose statistical properties may also vary considerably. For this class of applications, the conventional reservoir sampling technique does not guarantee a statistically sufficient number of tupl...

متن کامل

Approximate Integration of streaming data

We approximate analytic queries on streaming data with a weighted reservoir sampling. For a stream of tuples of a Datawarehouse we show how to approximate some Olap queries. For a stream of graph edges from a Social Network, we approximate the communities as the large connected components of the edges in the reservoir. We show that for a model of random graphs which follow a power law degree di...

متن کامل

Super-Sampling with a Reservoir

We introduce an alternative to reservoir sampling, a classic and popular algorithm for drawing a fixed-size subsample from streaming data in a single pass. Rather than draw a random sample, our approach performs an online optimization which aims to select the subset that provides the best overall approximation to the full data set, as judged using a kernel two-sample test. This produces subsets...

متن کامل

Sublinear Algorithms for MAXCUT and Correlation Clustering

We study sublinear algorithms for two fundamental graph problems, MAXCUT and correlation clustering. Our focus is on constructing core-sets as well as developing streaming algorithms for these problems. Constant space algorithms are known for dense graphs for these problems, while Ω(n) lower bounds exist (in the streaming setting) for sparse graphs. Our goal in this paper is to bridge the gap b...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014